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Abstract 

Regression with constructed from the covariance matrix should not be used for 
some combinations of covariance matrices and fitting functions. Using the technique for 
unsuitable combinations can amplify systematic errors. This amplification is 
uncontrolled, and can produce arbitrarily inaccurate results that might not be ruled 
out by a test. In addition, this technique can give incorrect (artificially small) errors 
for fit parameters. I give a test for this instability and a more robust (but 
computationally more intensive) method for fitting correlated data. 
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Recently there has been some interest in the analysis of correlated data, and people 
seeking more sophisticated analysis techniques have often performed regression, using 
the covariance matrix to construct [1]. DeGrand [2], DeTar and Kogut [3] and 
Gottlieb et al. [4] use the technique to analyze lattice gauge theory results, while 
Abreu et al. [5] and Wosiek [6] use it to analyze scaled factorial moment data. I show- 
here that this analysis technique can amplify systematic errors, unlike simpler, more 
robust techniques. 

This technique is simple in principle - transform the data to an uncorrelated basis, 
use regression to fit the data in this basis, then transform back to the laboratory frame. 
However, some of the results obtained by this procedure are very odd. In particular, 
Gottlieb et al. [4], Toussaint [7] and Wosiek [6] find that this procedure can produce 
best-fit lines that fall below all data points, and even below all error bars! 

In this paper, I first discuss the proposed treatment of correlated data, and show 
that in a gedanken experiment without systematic errors this treatment produces 
exactly the desired results. In a very similar gedanken experiment with arbitrarily 
small systematic errors, this procedure amplifies the errors in the data; therefore, this 
treatment of data is not robust. I use these simple gedanken experiments instead of the 
scaled factorial moment data or the lattice gauge theory results for purposes of 
presentation, as the effect observed in the different data sets is qualitatively the same. I 
then give a more robust alternative procedure for fitting correlated data, and in the 
course of this discussion a test for the stability of the regression is shown. 

The experimental procedure is very simple. Consider N trial measurements of / data 
points. Hi. Calculate the covariance matrix from these data. 



N 



Y.{yi,n-yi)iyj,n-yj) 



.n=l 



(1) 



where 



1 ^ 

n=l 



Fit the data with the curve fi{a), where {a} is the set of free parameters, by 
minimizing 

/ = EE(y.-/0(c-%(%-/,). (3) 

i=i j=i 

I illustrate this procedure with a gedanken experiment to measure the mean voltage 
of a generator that produces random voltages v with probability distribution p{v). The 
generator charges a capacitor, and I then measure the voltage on the capacitor twice, 
calling the two measurements yi and ^2- Each measurement has some (uncorrelated) 
"measurement noise" in addition to the fiuctuations due to the random voltage 
generator; I assume that this noise may be different for the two measurements. 

After N trials, the experimentally determined covariance matrix is 



where ^ 
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is the contribution to the covariance matrix from the distribution of random voltages, 
and e^2) is the contribution of noise from the set of first (second) measurements. 
Fitting to the function /i = /2 = V^, is minimized when 

V='-1^H±-^^, (6) 

where (2/1(2)) is the average value for the first (second) measurement. It is clear from 
Eq. (H) that V is the average of (yi) and (2/2), properly weighted for measurement 
error, and so the analysis procedure is very successful at fitting the curve for this 
gedanken experiment. 

The experimental error is given by 

^l-^iS-^xf. (7) 

Taking from eq. (^ and C from eq. (^ gives 

2 + l/(er' + 62 ') , . 
^^31 • (8) 



Again, this technique works well, clearly giving the correct error in the cases a = and 
a — s> 00. 

Now, I modify the gedanken experiment slightly, by assuming that the capacitor 
discharges somewhat between the two measurements. I therefore assume that the first 
measurement is unchanged, but that the voltage is reduced by a factor 7 at the time of 
the second measurement. I could alternatively assume that the scales of the voltmeters 
are slightly different, but I wish to have all systematic effects occur before 
measurement rather than during it. 

In this case, the experimentally determined covariance matrix is 

After an infinite number of measurements (yi) =v and (7/2) = 1^, where v is the mean 
value of the random voltage. Fitting again to the function /i = /2 = V, is 
minimized when 

The procedure gives systematic errors for this second experiment, as V is always less 
than V. This is not totally unexpected, because the discharge of the capacitor between 
the measurements gives a systematically lower value of 2/2- If 7 = 1, as in the first 
experiment, all systematic errors vanish. For any other value of 7, however, V can have 
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any value between zero and v. By contrast, a naive least-squares fit always yields a 
value for V between and v. Thus, the covariance matrix technique can produce 
large systematic errors from arbitrarily small intrinsic systematic errors. 

One might think that should be large whenever the fit is very bad (V <^v). 
However, this is not the case if the sample size is too small. In the limit cr — > oo, where 
the fit is the worst, 
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Thus, will be acceptably small whenever N < (cr/v)^, so that an infinite number of 
events may be required to rule out the worst fits. 

One might then expect that, if is acceptably small, the error in V will be large 
enough that V is within a few standard deviations of v. However, in the limit 
(7 - l)cr > 61,62, 
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(AT - 1) [(7 - 1)V2 + 6? + el] {N - 1)(7 - ir 

Thus, it is quite possible to have simultaneously V <^v, small, and {V 

Now I try a more robust technique, constructing the best estimator by minimizing 
the variance in 

y = ayi + (1 - a)y2. (13) 

The variance is 



'V 



(14) 



a' {{yf) - (yif) + 2a(l - a) ({y,y,) - {y,){y,)) + (1 - af {{yl) - {y,f) (15) 
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The condition day /da — then gives 
7(7 - 1)^2 + el 



(7-l)V2 + 6f + 6i 
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The value of V is the same for the two techniques, and Uy is the same when 7 = 1. 
In the limit 61, 62 — > I find 
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which is identical to the result obtained from regression. Thus, the techniques are 
almost the same. However, the best estimator technique is more transparent, and the 
cause of the instability is more easily recognized and corrected with this technique. 
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In the previous analysis I left out a condition — a and (1 — a) must both be 
non-negative. In this case, the solution ( pO] ) is only valid when 

el > (7 -IV', (21) 
el > -7(7 -IV'- (22) 

Applying this condition, ay is minimized with 



7<1, 

1 7>1, 



(23) 



in the limit ei, 62 — > 0. I then obtain 



V=\T' (24) 

and 

4 = (i"'+/' (25) 

[ (T^ + 7 > 1. ^ ^ 

Thus, the systematic error is not amplified with this procedure, and the estimate of ay 
is not artificially small. 

The crucial point is the non-negativity of a and 1 — a. Mathematically, this can be 
written as 

df- 

Vi : > 0. (26) 

dVi 

This general requirement for a stable fit is that, given a perturbation in the data, the 
function does not move locally against the direction of the perturbation. It is 
intuitively obvious, though I am not sure whether it has been rigorously demonstrated. 

The partial derivative is calculated as follows. The general fitting condition of 
minimizing can be written as 

V- i:(c-%|^(A-yd = o. (27) 

j,k ■' 

where {a} is the set of fitting parameters. If Hi Hi + 6yi, we must have now 

S(C-),{t||.^(/.-..)}^a.-i:(C-)„|^%,^0. (28) 

This can be written more compactly in matrix form: 

6ak = J2{M'')^^KaM, (29) 

a 

- ||.^(/.-.0}. (30) 

- i: (C-% (31) 
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Finally, I obtain 



^fi = E^5«f' (32) 



and the partial derivative is 

§^ = Ef^KO (34) 

For the fit to a constant V, dfi/dV — 1, so 

^f^ ^ {C-% 

The denominator is never negative, as it is equal to a sum of eigenvalues of (with 
all weights non-negative), and all eigenvalues of are non- negative. Thus, the 
stability condition for this regression is 

E(O,,>0, (36) 
j 

which is trivially satisfied for uncorrelated data (C is then diagonal). If this condition 
is violated, then the best estimator should be used instead of the regression, to obtain 

the variance in the fit parameters. 

The best estimator technique can also be used to fit lines and more complicated 
curves to data. For a line, first iit y = ax + b to all independent sets of points ij to 
obtain 

a,j = (37) 

X Xj 

Then construct linear estimators for the quantities a and b, 

a = '^kijdij: (39) 

b = J^kjhj, (40) 

with the constraints 

E% = E^^. = i- (41) 

Finally, minimize the variance in a and b, to obtain the values and variances of both, 
but with the conditions 

yij : kij,kj > 0. (42) 
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In general the procedure is not worth the effort required, as the fit parameters are 
identical to those obtained with regression. 

I have shown that covariance matrix regression should be supplemented by a test for 
the stabihty of the regression. When the regression is unstable, the fit parameters can 
be altered in an uncontrolled fashion. These alterations can sometimes be ruled out by 
a test; however, for arbitrarily small x^, if the data set is small and fluctuations are 
large, the apparent errors in fit parameters can be much smaller than the difference 
between their apparent values and the best estimators for these values. 

The alternative to using covariance matrix regression is to fit all possible sets of 
points (as many points per set as there are fit parameters) to obtain all possible 
linearly independent sets of the fit parameters, and use the linear combinations of the 
values obtained in this way (with no negative multipliers) that have the lowest 
variances as the best estimators of the fit parameters. This is computationally more 
cumbersome, but is the more rigorous procedure so it may be simplest to use this in 
the first place rather than attempting covariance matrix regression first. 

I thank F. James for helpful suggestions and K. Zalewski for useful discussions. This 
material is based upon work supported by the North Atlantic Treaty Organization 
under a Grant awarded in 1991. 
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